Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[draft] Implement SOZip storage of terra targets #62

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

brownag
Copy link
Contributor

@brownag brownag commented Apr 29, 2024

This is a draft PR to implement an option (zipfile=TRUE) to write/read Seek-Optimized ZIP (SOZIP) files in the target store for #37. This works for tar_terra* SpatVector and SpatRaster methods,

Rather than attempting to create (non-SOZIP) ZIP files independently using utils::zip() or similar this PR uses one of two (simpler, but somewhat more limited) pathways available thru existing GDAL drivers.

  • Specific drivers that support direct write of SOZIP via a file extension for SOZIP (ESRI Shapefile and Geopackage, as of GDAL 3.7) use that path for writing. So, ESRI Shapefile uses .shz extension (as it already does), and GeoPackage uses .gpkg.zip. Read is done via /vsizip/...

  • All other drivers use "/vsizip/{path/to/zipfile/target}/target" generic data source path for write and read. This is not supported by all drivers currently, but works for things like Parquet and GeoTIFF. GeoTIFF requires specific GDAL options (STREAMABLE_OUTPUT=YES, COMPRESS=NONE).

Examples of the above cases have been added to tests.

Raster:

targets::tar_script({
    list(
        geotargets::tar_terra_rast(
            test_terra_rast2,
            terra::rast(system.file("ex/elev.tif", package = "terra")),
            gdal = c("STREAMABLE_OUTPUT=YES", "COMPRESS=NONE"),
            zipfile = TRUE
        ),
        geotargets::tar_terra_rast(
            test_terra_rast3,
            terra::rast(system.file("ex/elev.tif", package = "terra")),
            filetype = "GPKG",
            zipfile = TRUE
        )
    )
})
targets::tar_make()
#> ▶ dispatched target test_terra_rast2
#> ● completed target test_terra_rast2 [0.007 seconds]
#> ▶ dispatched target test_terra_rast3
#> ● completed target test_terra_rast3 [0.004 seconds]
#> ▶ ended pipeline [0.202 seconds]
#> Warning messages:
#> 1: /vsizip/{_targets/scratch/test_terra_rast2}/test_terra_rast2, band 1: Cannot modify metadata at that point in a streamed output file (GDAL error 6) 
#> 2: /vsizip/{_targets/scratch/test_terra_rast2}/test_terra_rast2, band 1: Cannot modify metadata at that point in a streamed output file (GDAL error 6) 
#> 3: /vsizip/{_targets/scratch/test_terra_rast2}/test_terra_rast2, band 1: Cannot modify metadata at that point in a streamed output file (GDAL error 6) 
#> 4: /vsizip/{_targets/scratch/test_terra_rast2}/test_terra_rast2, band 1: Cannot modify metadata at that point in a streamed output file (GDAL error 6)

targets::tar_read(test_terra_rast2)
#> class       : SpatRaster 
#> dimensions  : 90, 95, 1  (nrow, ncol, nlyr)
#> resolution  : 0.008333333, 0.008333333  (x, y)
#> extent      : 5.741667, 6.533333, 49.44167, 50.19167  (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#> source      : test_terra_rast2} 
#> name        : elevation

targets::tar_read(test_terra_rast3)
#> class       : SpatRaster 
#> dimensions  : 90, 95, 1  (nrow, ncol, nlyr)
#> resolution  : 0.008333333, 0.008333333  (x, y)
#> extent      : 5.741667, 6.533333, 49.44167, 50.19167  (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#> source      : test_terra_rast3} 
#> name        : Height 
#> min value   :    141 
#> max value   :    547

Vector:

targets::tar_script({
    lux_area <- function(projection = "EPSG:4326") {
        terra::project(
            terra::vect(system.file("ex", "lux.shp",
                                    package = "terra"
            )),
            projection
        )
    }
    list(
        geotargets::tar_terra_vect(
            test_terra_vect,
            lux_area()
        ),
        geotargets::tar_terra_vect(
            test_terra_vect_shz,
            lux_area(),
            filetype = "ESRI Shapefile"
        ),
        geotargets::tar_terra_vect(
            test_terra_vect_parquet_zip,
            lux_area(),
            filetype = "Parquet",
            zipfile = TRUE
        )
    )
})
targets::tar_make()
#> ▶ dispatched target test_terra_vect_parquet_zip
#> ● completed target test_terra_vect_parquet_zip [0.028 seconds]
#> ▶ dispatched target test_terra_vect
#> ● completed target test_terra_vect [0.054 seconds]
#> ▶ dispatched target test_terra_vect_shz
#> ● completed target test_terra_vect_shz [0.006 seconds]
#> ▶ ended pipeline [0.248 seconds]

targets::tar_read(test_terra_vect)
#>  class       : SpatVector 
#>  geometry    : polygons 
#>  dimensions  : 12, 6  (geometries, attributes)
#>  extent      : 5.74414, 6.528252, 49.44781, 50.18162  (xmin, xmax, ymin, ymax)
#>  source      : test_terra_vect
#>  coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#>  names       :  ID_1   NAME_1  ID_2   NAME_2  AREA   POP
#>  type        : <num>    <chr> <num>    <chr> <num> <int>
#>  values      :     1 Diekirch     1 Clervaux   312 18081
#>                    1 Diekirch     2 Diekirch   218 32543
#>                    1 Diekirch     3  Redange   259 18664

targets::tar_read(test_terra_vect_shz)
#>  class       : SpatVector 
#>  geometry    : polygons 
#>  dimensions  : 12, 6  (geometries, attributes)
#>  extent      : 5.74414, 6.528252, 49.44781, 50.18162  (xmin, xmax, ymin, ymax)
#>  source      : test_terra_vect_shz} (test_terra_vect_shz)
#>  coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#>  names       :  ID_1   NAME_1  ID_2   NAME_2  AREA   POP
#>  type        : <num>    <chr> <num>    <chr> <num> <int>
#>  values      :     1 Diekirch     1 Clervaux   312 18081
#>                    1 Diekirch     2 Diekirch   218 32543
#>                    1 Diekirch     3  Redange   259 18664

targets::tar_read(test_terra_vect_parquet_zip)
#>  class       : SpatVector 
#>  geometry    : polygons 
#>  dimensions  : 12, 6  (geometries, attributes)
#>  extent      : 5.74414, 6.528252, 49.44781, 50.18162  (xmin, xmax, ymin, ymax)
#>  source      : test_terra_vect_parquet_zip}
#>  coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#>  names       :  ID_1   NAME_1  ID_2   NAME_2  AREA   POP
#>  type        : <num>    <chr> <num>    <chr> <num> <int>
#>  values      :     1 Diekirch     1 Clervaux   312 18081
#>                    1 Diekirch     2 Diekirch   218 32543
#>                    1 Diekirch     3  Redange   259 18664

In the future, {gdalraster} (#48) could be used to assemble SOZip files--this would allow for sidecar files to be included (it appears they are not stored using /vsizip/ to write at this time, need to investigate), and drivers that do not support direct write of SOZip to be supported.

@brownag
Copy link
Contributor Author

brownag commented Apr 29, 2024

Apparently Parquet driver is not available in GDAL build being used on GHA, used FlatGeobuf instead which mostly works (appears to come back in different order; GDAL bug?)

Also apparently the snapshots have slight differences depending on runner. I don't have any snapshot changes to accept locally, so may need to relax these specific tests (as appears to have been done in other recent cases)


EDIT:
GDAL >=3.5 is required for (Geo)Parquet (tests with it fail on GHA ubuntu latest), and >=3.8 for the v1.0.0 specification. https://gdal.org/drivers/vector/parquet.html

GDAL >=3.1 is required for FlatGeobuf https://gdal.org/drivers/vector/flatgeobuf.html

@Aariq
Copy link
Collaborator

Aariq commented Oct 23, 2024

For other formats, a more "generic" write and read function could be used:

write_to_zip <- function(object, path) {
    #rename path to not be confused with fs::path() just to make more readable
    out_path <- path
    #do stuff in a fresh local tempdir() that disappears when function is done
    tmp <- withr::local_tempdir()
    dir_create(tmp, fs::path_dir(out_path))
    #write the raster (hard-coded options for demonstration)
    writeRaster(object,
                fs::path(tmp, out_path),
                filetype = "GTiff",
                overwrite = TRUE)
    #figure out which files got written
    raster_files <- dir_ls(path(tmp, path_dir(out_path)))
    #package those into a zip file using `zip::zip()`
    zip::zip(
        path(tmp, fs::path_file(out_path)),
        files = fs::path_file(raster_files),
        compression_level = 1,
        root = fs::path_dir(raster_files)
    )
    #move the zip file to the out_path as expected output
    file_move(path(tmp, fs::path_file(out_path)), out_path)
}

read_from_zip <- function(path) {
    tmp <- local_tempdir()
    #extract into tempdir
    zip::unzip(zipfile = path, exdir = tmp)
    #read in as rast
    rast(fs::path(tmp, fs::path_file(path)))
}

I'm not sure how these compare to SOZip in terms of read and write overhead, but I assume they are worse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants